iSC.MEB: an R package for multi-sample spatial clustering analysis of spatial transcriptomics data

Abstract Summary Emerging spatially resolved transcriptomics (SRT) technologies are powerful in measuring gene expression profiles while retaining tissue spatial localization information and typically provide data from multiple tissue sections. We have previously developed the tool SC.MEB—an empirical Bayes approach for SRT data analysis using a hidden Markov random field. Here, we introduce an extension to SC.MEB, denoted as integrated spatial clustering with hidden Markov random field using empirical Bayes (iSC.MEB) that permits the users to simultaneously estimate the batch effect and perform spatial clustering for low-dimensional representations of multiple SRT datasets. We demonstrate that iSC.MEB can provide accurate cell/domain detection results using two SRT datasets. Availability and implementation iSC.MEB is implemented in an open-source R package, and source code is freely available at https://github.com/XiaoZhangryy/iSC.MEB. Documentation and vignettes are provided on our package website (https://xiaozhangryy.github.io/iSC.MEB/index.html). Supplementary information Supplementary data are available at Bioinformatics Advances online.

a Potts model, where δ is a Dirac function, C t (β t ) is a normalization constant, N ti is the neighborhood of spot i in tissue t, and β t is the smoothing parameter that controls the similarity among the neighboring labels, in other words, the degree of spatial smoothness. In addition, we assume a continuous multivariate HMRF for u ti to captures the spatial dependence in the embedding space. In detail, u ti is assumed to follows the CAR model (Besag 1974), where subscript −ti denotes all spots except spot i in tissue t, m ti is the number of neighborhoods of spot i in tissue t, µ u ti = m −1 ti j∈N ti u tj and Ψ t ∈ R q×q is the covariance matrix of u ti .
In the ICM step, we first estimate y ti by maximizing its posterior given u ti =û ti . By Bayesian formula, we obtain P(y ti |û ti ,ŷ −ti , v ti ) ∝ P(v ti |û ti , y ti )P(y ti |ŷ −ti ).
(2.2) Then, we estimate u ti by maximizing its posterior given y ti =ŷ ti . The Bayesian formula shows that Thus, the updated equation for u ti iŝ Finally, we repeat equation (2.2) and (2.4) until converge.
In the expectation (E) step, by the pseudo observed log-likelihood (1.7), we can derive its evidence lower bound (ELBO) is where expectation E θ (s) is taken with respect to (y ti , z ti , u ti ) givenŷ −ti ,û −ti , v ti and parameters θ (s) . By Bayes' Formula, we have .
Since (z ti , u ti ) givenŷ −ti ,û −ti , v ti , θ (s) is a multivariate normal distribution, using the formula about the conditional expectation and covariance of multivariate normal distribution, we have Taking derivatives of Q(θ) with respect to the parameter θ and setting them to zero, we obtain the updated equations in the maximization (M) step: Since there is no closed-form solution for β t , we update the smoothness parameter β t via a grid search strategy: The ICM-EM algorithm iterates the ICM step and M step until convergence.

MBIC criteria for determining K
In the implementation, iSC.MEB is implemented on a sequence of K, and then using MBIC (Wang et al. 2009) to determine the number of clusters K, which is given by where df K is the degree of freedom and C n is a positive constant that can depend on n and q. Following Ma & Huang (2017), we let C n = c ln(ln(n + q)), where c is a positive constant default as 1. Since the observed log-likelihood ln P(V ;θ K ) is intractable, we use the pseudo observed log-likelihood (1.7) to approximate it.

Methods for comparison
We conducted real data analysis to compare SE-MEB2 with existing methods of data integration and clustering. The following methods are considered as benchmarks of perfor- In the analysis, we treated the manual annotations as the ground truth to evaluate the clustering and data integration performance of the different methods. In the implementation, iSC.MEB and SC.MEB are implemented on a sequence of K, and then using MBIC (Wang et al. 2009) to determine the optimal number of clusters in a data-driven manner.
Louvain uses a community-modularity maximizing rule to choose the number of clusters. All other methods are implemented using the same number of clusters/domains as the manual annotations.
We evaluated the methods' performances by adjusted rand index (ARI, Hubert & Arabie (1985)) and normalized mutual information (NMI, Cover & Thomas (2006)). Both metrics are used for compares the overlap of two clusterings. ARI lies between −1 and 1, and a higher value means a higher degree of similarity between two partitions. NMI takes values from 0 to 1, and a NMI value of 1 indicate a perfect match. In addition, to evaluate the data integration clustering performance, we calculated ARI and NMI by comparing manual annotations against cluster labels on all samples, calling them integrated ARI (IARI) and integrated NMI (INMI), respectively.

Human dorsolateral prefrontal cortex Visium data
Recently, Maynard et al. (2021) generates the spatial topography of gene expression from 12 human postmortem DLPFC tissue sections by using the 10x Genomics Visium platform.
The expression count matrix contained 47,681 spatial locations, with 33,538 genes for each spot. The manual annotations of the layers based on the cytoarchitecture are also provided, which allowed us to evaluate the performance of both the data integration and accuracy of spatial domain detection by taking the manual annotations as ground truth. We first performed log-normalization of all datasets, and then obtain 15 top principal components (PCs) from 2,000 most highly variable genes. Finally, we performed clustering analysis for all methods. Figure 1 shows the ARI values, NMI scores and running time for 12 DLPFC samples.
iSC.MEB provided higher median of ARI and NMI values than all other methods, indicates its superior clustering performance. In addition, Figure 1 indicates that multi-sample analysis methods, iSC.MEB and BASS, gain an advantage from integration, and provide significantly better accuracy than single tissue section analysis methods. iSC.MEB, compared to BASS, had a higher ARI on 8 samples while having a comparable NMI. And in Figure 1, we can see that the computation time of iSC.MEB is significantly more efficient than BASS and BayesSpace. Louvain and SpaGCN were faster, but their clustering accuracy is inferior.
Furthermore, as shown in Table 1, the IARI and INMI were (0.48, 0.57) for iSC.MEB and (0.43, 0.53) for BASS, which indicates that iSC.MEB has better data integration performance than BASS. Note that BASS was performed with the 'true' number of clusters from manual annotations, iSC.MEB achieves better performance without this prior information.  3.2 Mouse embryo seqFISH data Lohoff et al. (2020) provided the high-resolution spatial map from 6 mouse embryo tissue sections a modified version of the seqFISH (sequential fluorescence in situ hybridization) method which allow highly-effective cell segmentation. The expression count matrix contained 387 selected target genes for each mRNA spot, and after cell-level quality control, we obtain a total of 51,800 spatial locations. Its annotation procedure first perform joint analysis of seqFISH datasets and an existing scRNA-seq atlas (Pijuan-Sala et al. 2019) by constructing a shared nearest neighbor network on PCs followed by Louvain network clustering. Then, use the same method to find subclusters within each cluster. Finally, some subclusters are manual re-annotated based on the expression of marker genes and the relative contribution of cells to each subcluster. In the analysis, we treated these manual annotations provided by Lohoff et al. (2020) as the ground truth to evaluate the clustering and data integration performance of the different methods. We first obtain 15 top PCs using a similar approach to what we did for the DLPFC dataset. Then, the integrate spatial clustering performed by iSC.MEB was compared with that of other methods except Louvain since Louvain was used for the annotation process of cell types.
The ARI values, NMI scores and running time obtained using different methods are shown in figure 2. iSC.MEB clearly outperform other methods in both ARI and NMI scores.
The runtime results demonstrated similar patterns to those obtained on DLPFC dataset.